Skip to content

Preserve stereochemistry and explicit aromatic '<-->' bonds in Mermaid↔RDKit converters#1

Merged
YZY-stack merged 1 commit into
mainfrom
codex/review-code-completeness
Jun 1, 2026
Merged

Preserve stereochemistry and explicit aromatic '<-->' bonds in Mermaid↔RDKit converters#1
YZY-stack merged 1 commit into
mainfrom
codex/review-code-completeness

Conversation

@boxuan-zhao
Copy link
Copy Markdown
Collaborator

Motivation

  • Add explicit support for an aromatic bond token and ensure aromaticity is preserved when converting Mermaid graphs to RDKit molecules.
  • Preserve double-bond E/Z and tetrahedral R/S stereochemistry robustly across round trips between RDKit and Mermaid.
  • Use absolute CIP labels for atom IDs during export so handedness is stable and serializable.

Description

  • Added <--> mapping to Chem.BondType.AROMATIC and updated bond regexes to accept the new aromatic token in molecode/markush/mermaid_to_rdkit.py, molecode/molecule/mermaid_to_rdkit.py, and related modules.
  • Postponed stereochemistry assignment until after topology is built and Chem.SanitizeMol is called by collecting stereo records and applying them with new helper methods _assign_chirality_from_ids and _assign_double_bond_stereo which restore CIP R/S and E/Z configurations.
  • Changed atom ID generation in the rdkit_to_mermaid converters to use RDKit's _CIPCode property (absolute R/S) instead of raw ChiralTag values so exported IDs are chemically meaningful and stable.
  • Made the system prompt a raw string to avoid escape parsing changes and added tests/test_molecule_stereochemistry.py with round-trip tests for aromatic bonds, E/Z double bonds, and R/S tetrahedral stereochemistry.

Testing

  • Added and ran pytest tests/test_molecule_stereochemistry.py which exercises aromatic non-kekulized rings, E/Z double-bond stereochemistry, and R/S tetrahedral stereochemistry, and all tests passed.
  • The new tests verify that exported Mermaid graphs contain the <--> aromatic token, ===|E|/===|Z| double-bond stereo markers, and _R/_S CIP atom-id suffixes, and that round-tripped molecules match canonical isomeric SMILES.

Codex Task

Copilot AI review requested due to automatic review settings June 1, 2026 02:10
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes the Mermaid ↔ RDKit converters preserve stereochemistry and aromaticity across round trips. It adds a new <--> aromatic bond token, switches chiral atom-ID suffixes from raw ChiralTag values to RDKit's absolute _CIPCode (R/S), and defers double-bond E/Z and tetrahedral R/S assignment until after Chem.SanitizeMol, applying them through new helper methods that search for the chirality that yields the requested CIP code.

Changes:

  • Add <--> aromatic bond mapping + regex support and propagate aromatic flags to atoms/bonds in both molecule and markush parsers.
  • Postpone stereochemistry assignment until after sanitization via new _assign_chirality_from_ids / _assign_double_bond_stereo helpers; emit atom IDs using _CIPCode rather than ChiralTag.
  • Make the system prompt a raw string and add round-trip tests for aromatic, E/Z, and R/S cases.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
molecode/molecule/mermaid_to_rdkit.py Adds <--> bond mapping/regex, defers stereo assignment, adds helpers that recover absolute CIP R/S by trying both chiral tags.
molecode/molecule/rdkit_to_mermaid.py Uses _CIPCode instead of ChiralTag to suffix atom IDs with R/S.
molecode/markush/mermaid_to_rdkit.py Mirrors the molecule parser changes for the markush variant.
molecode/markush/rdkit_to_mermaid.py Mirrors the molecule exporter change to use _CIPCode.
molecode/prompts/molecule_system_prompt.py Switches BASE_INSTRUCTION to a raw string to prevent escape-sequence reinterpretation.
tests/test_molecule_stereochemistry.py New round-trip tests covering aromatic <-->, ===|E|/===|Z|, and _R/_S CIP suffixes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Chem.BondType.AROMATIC: '<-->', #非必要不使用,尽量使用凯酷勒式

P2 Badge Keep the prompt in sync with aromatic bonds

Adding <--> here means mol_to_mermaid(..., kekulize=False) now emits syntax that MOLECULE_SYSTEM_PROMPT still tells the model is invalid: molecode/prompts/molecule_system_prompt.py:223-239 says aromatic bonds must use Kekulé form and that <--> cannot be recognized, and the common-errors table at line 547 still marks it wrong. Any LLM workflow using that prompt to generate or edit non-kekulized Mermaid will be instructed to reject/rewrite valid converter output, so the new explicit-aromatic feature is not usable through the prompted path until the prompt is updated.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@YZY-stack YZY-stack merged commit 96e7a6f into main Jun 1, 2026
1 check passed
YZY-stack added a commit that referenced this pull request Jun 1, 2026
Extends PR #1's stereochemistry fix (previously molecule + markush only) to the
polymer path, which until now silently dropped chirality and double-bond
geometry on round-trip.

- polymer_to_mermaid: emit absolute CIP R/S (_CIPCode) in atom ids instead of
  order-dependent CHI_TETRAHEDRAL_CW/CCW (same bug fixed in the molecule converter).
- mermaid_to_psmiles: parse _R/_S id suffixes and ===|E|/===|Z| markers, then
  restore them after sanitize — chirality via brute-forced CIP matching, and
  double-bond geometry via SetDoubleBondNeighborDirections so SMILES emits / and \.
- tests/test_polymer_stereochemistry.py: PLA R/S, cis/trans polybutadiene,
  combined chiral+E/Z, and plain-polymer regression coverage.

Verified: chiral and E/Z PSMILES now round-trip to canonical-identical SMILES,
and cis is distinguished from trans.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants